---
id: multimodal-metrics-g-eval
title: Multimodal G-Eval
sidebar_label: Multimodal G-Eval
---

<head>
  <link rel="canonical" href="https://deepeval.com/docs/metrics-llm-evals" />
</head>

import MetricTagsDisplayer from "@site/src/components/MetricTagsDisplayer";

<MetricTagsDisplayer custom={true} multimodal={true} />

The multimodal G-Eval is an adopted version of `deepeval`'s popular [`GEval` metric](/docs/metrics-llm-evals) but for evaluating multimodality LLM interactions instead.

It is currently the best way to define custom criteria to evaluate text + images in `deepeval`. By defining a custom `MultimodalGEval`, you can easily determine how well your MLLMs are generating, editing, and referncing images for example.

## Required Arguments

To use the `MultimodalGEval`, you'll have to provide the following arguments when creating a [`MLLMTestCase`](/docs/evaluation-test-cases#mllm-test-case):

- `input`
- `actual_output`

You'll also need to supply any additional arguments such as `expected_output` and `context` if your evaluation criteria depends on these parameters.

:::tip
The `input`s and `actual_output`s of an `MLLMTestCase` is a list of strings and/or `MLLMImage` objects.
:::

## Usage

To create a custom metric that uses MLLMs for evaluation, simply instantiate an `MultimodalGEval` class and **define an evaluation criteria in everyday language**:

```python
from deepeval.metrics import MultimodalGEval
from deepeval.test_case import MLLMTestCaseParams, MLLMTestCase, MLLMImage

m_test_case = MLLMTestCase(
    input=["Show me how to fold an airplane"],
    actual_output=[
        "1. Take the sheet of paper and fold it lengthwise",
        MLLMImage(url="./paper_plane_1", local=True),
        "2. Unfold the paper. Fold the top left and right corners towards the center.",
        MLLMImage(url="./paper_plane_2", local=True)
    ]
)
text_image_coherence = MultimodalGEval(
    name="Text-Image Coherence",
    criteria="Determine whether the images and text is coherence in the actual output.",
    evaluation_params=[MLLMTestCaseParams.ACTUAL_OUTPUT],
)

text_image_coherence.measure(m_test_case)
print(text_image_coherence.score, text_image_coherence.reason)
```

There are **THREE** mandatory and **SEVEN** optional parameters required when instantiating an `MultimodalGEval` class:

- `name`: name of custom metric.
- `criteria`: a description outlining the specific evaluation aspects for each test case.
- `evaluation_params`: a list of type `MLLMTestCaseParams`. Include only the parameters that are relevant for evaluation.
- [Optional] `evaluation_steps`: a list of strings outlining the exact steps the LLM should take for evaluation. If `evaluation_steps` is not provided, `GEval` will generate a series of `evaluation_steps` on your behalf based on the provided `criteria`.
- [Optional] `rubric`: a list of `Rubric`s that allows you to [confine the range](/docs/metrics-llm-evals#rubric) of the final metric score.
- [Optional] `threshold`: the passing threshold, defaulted to 0.5.
- [Optional] `model`: a string specifying which of OpenAI's GPT models to use, **OR** [any custom LLM model](/docs/metrics-introduction#using-a-custom-llm) of type `DeepEvalBaseLLM`. Defaulted to 'gpt-4o'.
- [Optional] `strict_mode`: a boolean which when set to `True`, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to `False`.
- [Optional] `async_mode`: a boolean which when set to `True`, enables [concurrent execution within the `measure()` method.](/docs/metrics-introduction#measuring-metrics-in-async) Defaulted to `True`.
- [Optional] `verbose_mode`: a boolean which when set to `True`, prints the intermediate steps used to calculate said metric to the console, as outlined in the [How Is It Calculated](#how-is-it-calculated) section. Defaulted to `False`.

:::danger
For accurate and valid results, only the parameters that are mentioned in `criteria`/`evaluation_params` should be included as a member of `evaluation_params`.
:::

### Evaluation Steps

Similar to regular [`GEval`](/docs/metrics-llm-evals), providing `evaluation_steps` tells `MultimodalGEval` to follow your `evaluation_steps` for evaluation instead of first generating one from `criteria`, which allows for more controllable metric scores:

```python
...

text_image_coherence = MultimodalGEval(
    name="Text-Image Coherence",
    evaluation_steps=[
        "Evaluate whether the images and the accompanying text in the actual output logically match and support each other.",
        "Check if the visual elements (images) enhance or contradict the meaning conveyed by the text.",
        "If there is a lack of coherence, identify where and how the text and images diverge or create confusion.,
    ],
    evaluation_params=[MLLMTestCaseParams.ACTUAL_OUTPUT],
)
```

### Rubric

You can also provide `Rubric`s through the `rubric` argument to confine your evaluation MLLM to output in specific score ranges:

```python
from deepeval.metrics.g_eval import Rubric
...

text_image_coherence = MultimodalGEval(
    name="Text-Image Coherence",
    rubric=[
        Rubric(score_range=(1, 3), expected_outcome="Text and image are incoherent or conflicting."),
        Rubric(score_range=(4, 7), expected_outcome="Partial coherence with some mismatches."),
        Rubric(score_range=(8, 10), expected_outcome="Text and image are clearly coherent and aligned."),
    ],
    evaluation_params=[MLLMTestCaseParams.ACTUAL_OUTPUT],
)

```

Note that `score_range` ranges from **0 - 10, inclusive** and different `Rubric`s must not have overlapping `score_range`s. You can also specify `score_range`s where the start and end values are the same to represent a single score.

:::tip
This is an optional improvement done by `deepeval` in addition to the original implementation in the `GEval` paper.
:::

## How Is It Calculated?

The `MultimodalGEval` is an adapted version of [`GEval`](/docs/metrics-llm-evals), so alike `GEval`, the `MultimodalGEval` metric is a two-step algorithm that first generates a series of `evaluation_steps` using chain of thoughts (CoTs) based on the given `criteria`, before using the generated `evaluation_steps` to determine the final score using the `evaluation_params` provided through the `MLLMTestCase`.

Unlike regular `GEval` though, the `MultimodalGEval` takes images into consideration as well.

:::tip
Similar to the original [G-Eval paper](https://arxiv.org/abs/2303.16634), the `MultimodalGEval` metric uses the probabilities of the LLM output tokens to normalize the score by calculating a weighted summation. This step was introduced in the paper to minimize bias in LLM scoring, and is automatically handled by `deepeval` (unless you're using a custom LLM).
:::
